The US Department of Energy has launched the Materials Project initiative, which provides open source data on thousands of materials. In this analysis, I focus on the MP Batteries dataset, which has been released as part of the project. The aim of the analysis is to present the characteristics of the batteries. The analysis focuses on 4 different aspects: distribution of attributes, correlations between attributes, characteristics depending on working ion and predictions.
The analysis is performed with the R language. I used the following
packages: - dplyr to clean up the dataset and get better
control over the data, - kableExtra,
ggcorrplot, GGally and plotly to
prepare visualisations, - caret, randomForest,
RRF to run regressions on the dataset.
To ensure the reproducibility of my work, I set the seed (initial state for random number generation) to 379.
The MP Batteries dataset comes from materials project website. It consists of 4351 observations and 17 variables: 1 identification, 4 string type, 1 discrete and 11 continuous. There are no missing values in the dataset.
X <- read.csv("./data/mp_batteries.csv", header = TRUE, sep = ",")
X %>%
head %>%
kable %>%
kable_styling("striped", full_width = F) %>%
kableExtra::scroll_box(width = "800px")
| Battery.ID | Battery.Formula | Working.Ion | Formula.Charge | Formula.Discharge | Max.Delta.Volume | Average.Voltage | Gravimetric.Capacity | Volumetric.Capacity | Gravimetric.Energy | Volumetric.Energy | Atomic.Fraction.Charge | Atomic.Fraction.Discharge | Stability.Charge | Stability.Discharge | Steps | Max.Voltage.Step |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mp-30_Al | Al0-2Cu | Al | Cu | Al2Cu | 3.043399 | 0.0890331 | 1368.481 | 5562.790 | 121.84009 | 495.27253 | 0.0 | 0.6666667 | 0.0000000 | 0.0000000 | 1 | 0 |
| mp-1022721_Al | Al1-3Cu | Al | AlCu | Al3Cu | 1.243653 | -0.0215863 | 1112.937 | 4418.980 | -24.02423 | -95.38962 | 0.5 | 0.7500000 | 0.0740612 | 0.0962458 | 1 | 0 |
| mp-8637_Al | Al0-5Mo | Al | Mo | Al5Mo | 4.762574 | 0.1227568 | 1741.504 | 7175.702 | 213.78156 | 880.86651 | 0.0 | 0.8333333 | 0.4114601 | 0.0452120 | 1 | 0 |
| mp-129_Al | Al0-12Mo | Al | Mo | Al12Mo | 12.723893 | 0.0431214 | 2298.811 | 7346.232 | 99.12801 | 316.78006 | 0.0 | 0.9230769 | 0.0000000 | 0.0114456 | 1 | 0 |
| mp-91_Al | Al0-12W | Al | W | Al12W | 12.494598 | 0.0292342 | 1900.745 | 7332.719 | 55.56677 | 214.36621 | 0.0 | 0.9230769 | 0.0000000 | 0.0000000 | 1 | 0 |
| mp-1055908_Al | Al0-12Mn | Al | Mn | MnAl12 | 18.236156 | 0.0397314 | 2547.693 | 7592.916 | 101.22330 | 301.67688 | 0.0 | 0.9230769 | 0.1454643 | 0.0000000 | 1 | 0 |
colSums(is.na(X))
## Battery.ID Battery.Formula Working.Ion
## 0 0 0
## Formula.Charge Formula.Discharge Max.Delta.Volume
## 0 0 0
## Average.Voltage Gravimetric.Capacity Volumetric.Capacity
## 0 0 0
## Gravimetric.Energy Volumetric.Energy Atomic.Fraction.Charge
## 0 0 0
## Atomic.Fraction.Discharge Stability.Charge Stability.Discharge
## 0 0 0
## Steps Max.Voltage.Step
## 0 0
str(X)
## 'data.frame': 4351 obs. of 17 variables:
## $ Battery.ID : chr "mp-30_Al" "mp-1022721_Al" "mp-8637_Al" "mp-129_Al" ...
## $ Battery.Formula : chr "Al0-2Cu" "Al1-3Cu" "Al0-5Mo" "Al0-12Mo" ...
## $ Working.Ion : chr "Al" "Al" "Al" "Al" ...
## $ Formula.Charge : chr "Cu" "AlCu" "Mo" "Mo" ...
## $ Formula.Discharge : chr "Al2Cu" "Al3Cu" "Al5Mo" "Al12Mo" ...
## $ Max.Delta.Volume : num 3.04 1.24 4.76 12.72 12.49 ...
## $ Average.Voltage : num 0.089 -0.0216 0.1228 0.0431 0.0292 ...
## $ Gravimetric.Capacity : num 1368 1113 1742 2299 1901 ...
## $ Volumetric.Capacity : num 5563 4419 7176 7346 7333 ...
## $ Gravimetric.Energy : num 121.8 -24 213.8 99.1 55.6 ...
## $ Volumetric.Energy : num 495.3 -95.4 880.9 316.8 214.4 ...
## $ Atomic.Fraction.Charge : num 0 0.5 0 0 0 ...
## $ Atomic.Fraction.Discharge: num 0.667 0.75 0.833 0.923 0.923 ...
## $ Stability.Charge : num 0 0.0741 0.4115 0 0 ...
## $ Stability.Discharge : num 0 0.0962 0.0452 0.0114 0 ...
## $ Steps : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Max.Voltage.Step : num 0 0 0 0 0 0 0 0 0 0 ...
for (name in colnames(X))
{
if(is.numeric(X[[name]]))
{
cat("\n")
print(name)
print(summary(X[[name]]))
}
}
##
## [1] "Max.Delta.Volume"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00002 0.01747 0.04203 0.37531 0.08595 293.19322
##
## [1] "Average.Voltage"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -7.755 2.226 3.301 3.083 4.019 54.569
##
## [1] "Gravimetric.Capacity"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.176 88.108 130.691 158.291 187.600 2557.627
##
## [1] "Volumetric.Capacity"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.08 311.62 507.03 610.62 722.75 7619.19
##
## [1] "Gravimetric.Energy"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -583.5 211.7 401.8 444.1 614.4 5926.9
##
## [1] "Volumetric.Energy"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2208.1 821.6 1463.8 1664.0 2252.3 18305.9
##
## [1] "Atomic.Fraction.Charge"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.03986 0.04762 0.90909
##
## [1] "Atomic.Fraction.Discharge"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.007407 0.086957 0.142857 0.159077 0.200000 0.993333
##
## [1] "Stability.Charge"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.03301 0.07319 0.14257 0.13160 6.48710
##
## [1] "Stability.Discharge"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.01952 0.04878 0.12207 0.09299 6.27781
##
## [1] "Steps"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.167 1.000 6.000
##
## [1] "Max.Voltage.Step"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1503 0.0000 26.9607
rm(list = c("name"))
** String variables ** While the ranges of the
battery formula, formula charge and
formula discharge attributes are wide compared to the
number of observations, the range of the working ion
attribute is narrow and consists of 11 values. The vast majority of the
batteries studied have Li as the working ion (the main ion
that transports electric charge). Other types of batteries included in
the dataset are calcium, magnesium, sodium and zinc. There are also
batteries with Al, Cs, K,
Rb, Y as the working ion, but the number of
observations with them in the dataset is marginal. Therefore, in the
following analysis I combine them into the category
Other.
** Continuous variables ** The distribution density plot of each
continuous attribute can be described as a plot with a high peak and a
long tail. Among them, the most evenly distributed is
atomic fracton discharge variable.
** Discrete variables ** steps is the only discrete
attribute in the dataset. There are very few observations with more than
1 step between full charge and discharge.
for (name in colnames(X))
{
if(name == "Battery.ID" ) next
threshold <- 10
if(is.numeric(X[[name]]) && n_distinct(X[[name]]) > threshold)
{
plot(density(X[[name]]), main = name)
}else if(n_distinct(X[[name]]) <= threshold){
barplot(table(X[[name]]), main = name)
}else{
barplot(table(X[[name]]), main = paste(name, ": ", n_distinct(X[[name]]), " distinct values"), xaxt = 'n')
# print(paste(name, ": ", n_distinct(X[[name]]), " distinct values"))
}
}
rm(list = c("name", "threshold"))
In order to gain more insight into the reasons for such a specific
distribution of numerical variables, I present histograms of them,
divided into panels according to working ion. The panels are presented
for 6 categories: Li, Ca, Mg,
Na, Zn and Other combining the
rest of the observations. Note that for better visibility I do not show
some outliers on the panels - they are filtered out by this line of
code:
xlim(quantile(mutated_X[,i], probs = c(0.05)), quantile(mutated_X[,i], probs = c(0.95))).
Due to the majority of observations of batteries with lithium as the
working ion, the distribution plots are most affected by this. It can be
observed that while the distribution for some characteristics is similar
for each battery category (max delta volume,
stability charge, stability discharge), there
are also characteristics that have different distributions for batteries
with different working ions. In the next section I analyse which
characteristics describe different battery categories.
mutated_X <- X %>%
mutate(Working.Ion.Other = ifelse(Working.Ion %in% c("Li", "Ca", "Mg", "Na", "Zn"), Working.Ion, "Other" ))
for (i in 6:17) {
p <- mutated_X %>%
ggplot(aes(x=mutated_X[,i])) +
geom_histogram(bins = 20) +
xlim(quantile(mutated_X[,i], probs = c(0.05)), quantile(mutated_X[,i], probs = c(0.95))) +
facet_grid(cols = vars(Working.Ion.Other)) +
xlab(colnames(mutated_X)[i])
print(p)
}
## Warning: Removed 436 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 436 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 435 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 436 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 436 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 436 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 184 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 430 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 216 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 218 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 99 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 218 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
rm(list = c("i", "p"))
In this section I want to discover some characteristics of batteries
with respect to the working ion. Due to the overrepresentation of
lithium batteries in the dataset, I create a sample where there are 200
observations for each battery type. Note that I still use the
other category for combination of underrepresented
batteries.
Once again I use the histogram to show the distribution of attribute values. Each category is now represented by a different colour instead of being on a separate panel. This helps me to better see the relationship between attribute values for different types of bettery. Below I present my observations regarding the working ion.
** Lithum (Li) ** Lithum batteries are characterised by a high average voltage (between 3 and 5 volts). Most of them have a volumetric capacity value below 500 and a stability charge value below 1.2. The stability discharge of lithium batteries is in most cases also below 1.2.
** Calcium (Ca) ** Calcium batteries reach values from a very wide range for almost all attributes, except stability charge where they reach values below 1.5 in most cases, and atomic fraction discharge where the values are also mostly below 1.5.
** Magnesium (Mg) ** Similar to calcium batteries, the variance for most attributes is high for magnesium batteries. However, there are some characteristics specific to this type of battery. Their average voltage is less than 4 and their maximum delta volume is less than 0.17.
** Sodium (Na) ** Sodium batteries are characterised by low volumetric (below 600) and gravimetric (below 200) capacities. They also have low values for charge stability (less than 2.0) and discharge stability (less than 1.5).
** Zinc (Zn) ** Znic batteries have an average voltage of less than 3.0. Their volumetric energy is mostly below 2500 and their gravimetric energy below 600. They also achieve low values for atomic fraction discharge and stability discharge (below 2.0).
mutated_sample_X <- mutated_X %>%
group_by(Working.Ion.Other) %>%
slice_sample(n=200)
for (i in c(6:15,17)) {
p <- mutated_sample_X %>%
ggplot(aes(x=mutated_sample_X[[colnames(mutated_sample_X)[i]]], fill=Working.Ion.Other, color=Working.Ion.Other)) +
geom_histogram(bins = 20, alpha = 0.5) +
xlim(quantile(mutated_X[[colnames(mutated_X)[i]]], probs = c(0.05)), quantile(mutated_X[[colnames(mutated_X)[i]]], probs = c(0.95))) +
ylim(0,250) +
xlab(colnames(mutated_sample_X)[i])
print(p)
}
## Warning: Removed 146 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 139 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 129 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 147 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 143 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 161 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 30 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 119 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 13 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 66 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 69 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
## Warning: Removed 37 rows containing non-finite outside the scale range (`stat_bin()`).
## Removed 12 rows containing missing values or values outside the scale range
## (`geom_bar()`).
rm(list = c("i", "p"))
My observations from the previous section can be supported by a variable importance plot. As can be seen on the graph, ‘stability discharge’, ‘volumetric energy’, ‘gravimetric energy’, ‘stability charge’ and ‘atomic fraction discharge’ are the 5 most important attributes in distinguishing batteries in terms of their working iron. All 5 attributes are also present in my conclusions from the histograms presented in the previous paragraph.
rrfMod <- train(Working.Ion.Other ~ ., data=mutated_sample_X[,c(10:15,17:18)], method="RRF")
rrfImp <- varImp(rrfMod, scale=F)
plot(rrfImp, top = 5, main='Variable Importance')
rm(list = c("rrfImp", "rrfMod"))
The next step is to analyse linear correlations between attributes. I
only compute correlations for numerical variables. Note that I do not
present results for the variables max delta volume and
steps. According to the documentation,
max delta volume is calculated from two other variables:
stability charge and stability discharge. The
steps variable has a very low variance, most observations
have a value of steps equal to 1.
The correlation matrix below shows, that the
gravimetric energy and volumetric energy have
very strong correlation between each other. Similarly
gravimetric capacity and volumetric capacity
are very strongly correlated with each other. Slightly smaller but still
strong correlations are found between
atomic fracton discharge and
gravimetric capacity, atomic fracton discharge
and volumetric capacity, average voltage and
gravimetric energy.
mutated_sample_X %>%
ungroup%>%
select(c(Average.Voltage:Stability.Discharge, Max.Voltage.Step)) %>%
cor() %>%
round(1) %>%
ggcorrplot(type = "lower", lab = TRUE)
Below I present 5 more correlation matrices. Each of them shows
correlation values between attributes depending on the working ion in
the battery.
** Lithium (Li) ** For lithium batteries,
stability charge and stability discharge
values are very strongly correlated with each other.
gravimetric energy is strongly correlated with
gravimetric capacity. The same is true for
volumetric energy and volumetric capacity.
** Calcium (Ca) ** Similar to lithium batteries, calcium batteries
also have a strongly correlated gravimetric capacity with
gravimetric energy and volumetric energy with
volumetric capacity. Interestingly, calcium batteries have
an inverse correlation between average voltage and
stability discharge. This is the only strong inverse
correlation I have observed.
** Magnesium (Mg) ** Magnesium batteries have perfect correlation
between gravimetric energy and
volumetric eneregy. They are also very strongly associated
volumetric capacity and gravimetric capacity
with atomic fraction discharge.
** Sodium (Na) ** As with lithium batteries,
stability charge and stability discharge are
very strongly correlated for sodium batteries.
** Zinc (Zn) ** Zinc batteries have perfect correlations between
gravimetric energy and volumetric eneregy,
gravimetric capacity and `volumetric capacity,
volumetric capacity and
atomic fraction discharge.
for(ion in c("Li", "Ca", "Mg", "Na", "Zn"))
{
p <- X %>%
filter(Working.Ion == ion) %>%
ungroup %>%
select(c(Average.Voltage:Stability.Discharge, Max.Voltage.Step)) %>%
cor() %>%
round(1) %>%
ggcorrplot(type = "lower", lab = TRUE) +
ggtitle(ion)
print(p)
}
rm(list = c("ion", "p"))
To better understand the correlations between attributes I plot graphs for the variables with the strongest associations.
mutated_sample_X %>%
ggplot(aes(x=Gravimetric.Energy, y=Volumetric.Energy)) +
geom_point(aes(color=Working.Ion.Other)) +
geom_smooth(method = "gam") +
geom_rug(aes(color=Working.Ion.Other))
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
mutated_sample_X %>%
ggplot(aes(x=Gravimetric.Capacity, y=Volumetric.Capacity)) +
geom_point(aes(color=Working.Ion.Other)) +
geom_smooth(method = "gam") +
geom_rug(aes(color=Working.Ion.Other))
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
mutated_sample_X %>%
ggplot(aes(x=Gravimetric.Capacity, y=Atomic.Fraction.Discharge)) +
geom_point(aes(color=Working.Ion.Other)) +
geom_smooth(method = "gam") +
geom_rug(aes(color=Working.Ion.Other))
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
mutated_sample_X %>%
ggplot(aes(x=Volumetric.Capacity, y=Atomic.Fraction.Discharge)) +
geom_point(aes(color=Working.Ion.Other)) +
geom_smooth(method = "gam") +
geom_rug(aes(color=Working.Ion.Other))
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
mutated_sample_X %>%
ggplot(aes(x=Gravimetric.Energy, y=Average.Voltage)) +
geom_point(aes(color=Working.Ion.Other)) +
geom_smooth(method = "gam") +
geom_rug(aes(color=Working.Ion.Other))
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
To examine correlations between attributes for different battery types,
I use interactive plots that allow the presented observations to be
filtered by working ion.
The linear correlation between volumetric energy and
gravimetric energy is strong for all battery types. For
batteries with magnesium as working ion volumetric capacity
grows linearly with gravimetric capacity, until it reaches
4000.
p <- mutated_sample_X %>%
ggplot(aes(x=Gravimetric.Energy, y=Volumetric.Energy)) +
geom_point(aes(color=Working.Ion.Other))
ggplotly(p)
p <- mutated_sample_X %>%
ggplot(aes(x=Gravimetric.Capacity, y=Volumetric.Capacity)) +
geom_point(aes(color=Working.Ion.Other))
ggplotly(p)
p <- mutated_sample_X %>%
ggplot(aes(x=Gravimetric.Capacity, y=Atomic.Fraction.Discharge)) +
geom_point(aes(color=Working.Ion.Other))
ggplotly(p)
p <- mutated_sample_X %>%
ggplot(aes(x=Volumetric.Capacity, y=Atomic.Fraction.Discharge)) +
geom_point(aes(color=Working.Ion.Other))
ggplotly(p)
p <- mutated_sample_X %>%
ggplot(aes(x=Gravimetric.Energy, y=Average.Voltage)) +
geom_point(aes(color=Working.Ion.Other))
ggplotly(p)
rm(list = c("p"))
On the 3D graph I plot 3 of the 5 most important attributes presented
in Attribute importance analysis section. However I do not
observe any separate group on the plot.
plot_ly(mutated_sample_X, x=~Atomic.Fraction.Discharge, y=~Stability.Discharge, z=~Volumetric.Energy, type="scatter3d", color=~Working.Ion.Other)
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
Most of the observations presented in the dataset are for lithium batteries. There is also a focus on calcium, magnesium, sodium and zinc batteries. There are a few observations for other types of batteries, but for now these are marginal examples.
Due to the large presence of lithium batteries in the dataset, I tryto create a classifier to classify whether a battery is lithium or not.
li_X <- X %>%
mutate(Li = ifelse(Working.Ion %in% c("Li"), 'Yes', 'No' )) %>%
select(c(Average.Voltage:Stability.Discharge, Max.Voltage.Step, Li))
I split the data set into training and test data with a ratio of 9:1.
inTraining <-
createDataPartition(y=li_X$Li, p=0.9, list=FALSE)
X_Train <- li_X[inTraining,]
X_Test <- li_X[-inTraining,]
rm(list = c("inTraining"))
In training, I use the random forest method with repeated cross-validation with set partition equal to 2 and 5 repetitions.
ctrl <- trainControl(
method = "repeatedcv",
number = 2,
repeats = 5)
fit <- train(Li ~ .,
data = X_Train,
method = "rf",
trControl = ctrl,
ntree = 10)
fit
## Random Forest
##
## 3916 samples
## 10 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (2 fold, repeated 5 times)
## Summary of sample sizes: 1958, 1958, 1958, 1958, 1958, 1958, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8589888 0.7131421
## 6 0.8626149 0.7204872
## 10 0.8599591 0.7150745
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
I get an accuracy of 0.8805 on a test data set.
rfClasses <- predict(fit, newdata = X_Test)
confusionMatrix(data = rfClasses, as.factor(X_Test$Li))
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 161 26
## Yes 30 218
##
## Accuracy : 0.8713
## 95% CI : (0.8361, 0.9013)
## No Information Rate : 0.5609
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7381
##
## Mcnemar's Test P-Value : 0.6885
##
## Sensitivity : 0.8429
## Specificity : 0.8934
## Pos Pred Value : 0.8610
## Neg Pred Value : 0.8790
## Prevalence : 0.4391
## Detection Rate : 0.3701
## Detection Prevalence : 0.4299
## Balanced Accuracy : 0.8682
##
## 'Positive' Class : No
##